CRUXEval-input: by examples

Home   Doc/Code

Not solved by any model

There are 19 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/179, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/229, CRUXEval-input/236, CRUXEval-input/259, CRUXEval-input/413, CRUXEval-input/423, CRUXEval-input/444, CRUXEval-input/501, CRUXEval-input/545, CRUXEval-input/581, CRUXEval-input/729

Problems solved by 1 model only

example_link model min_elo
CRUXEval-input/294 gpt-4-0613+cot 1308.738
CRUXEval-input/754 gpt-4-0613+cot 1308.738
CRUXEval-input/232 gpt-4-0613+cot 1308.738
CRUXEval-input/647 gpt-4-0613+cot 1308.738
CRUXEval-input/250 gpt-4-0613+cot 1308.738
CRUXEval-input/391 gpt-4-0613+cot 1308.738
CRUXEval-input/474 gpt-4-0613 1237.380
CRUXEval-input/314 gpt-3.5-turbo-0613 1000.000
CRUXEval-input/770 phind 973.258
CRUXEval-input/119 mixtral-8x7b 868.657

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
CRUXEval-input/660 0.771 -0.430
CRUXEval-input/531 0.657 -0.424
CRUXEval-input/373 0.629 -0.422
CRUXEval-input/242 0.886 -0.398
CRUXEval-input/222 0.714 -0.394
CRUXEval-input/233 0.571 -0.388
CRUXEval-input/28 0.829 -0.361
CRUXEval-input/65 0.743 -0.332
CRUXEval-input/199 0.571 -0.327
CRUXEval-input/598 0.857 -0.308

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.